Just days after the the first SARS-Cov-2 cases were identified in the United States, the degree to which different states were affected became apparent. The USNS Hospital Ship Comfort arrived at Pier 88 of New York Harbor on the Monday of March 30, while life in rural US remained unnafected for the majority. This difference of impact continued throughout the summer and into the fall, as state and local governments took an array of different approaches to balance the public health impact of COVID-19 with the economic impact.
In the following analysis, I will attempt to display varying degrees of impact COVID-19 has had on six states:
These states represent a vareity of demographics and geographies. By analyzing the death toll of these states, as well as their attributes such as geography and population density, I hope to identify casual relationships that will be useful in building a predictive model.
Pandas and Geopandas will be used for data manipulation, Bokeh for interactive plotting, Folium for interactive maps, and SciKit Learn for ML model creation.
#!pip install jupyterthemes folium geopandas
!jt -t gruvboxd -T -N -cellw 70%
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; margin-left:15% !important; } }</style>"))
import folium
import requests
import pandas as pd
import geojson
import json
import numpy as np
import geopandas as gpd
from datetime import datetime
from sklearn import preprocessing
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh.plotting as bkplt
from bokeh.io import output_notebook, show
from bokeh.resources import INLINE
from bokeh.layouts import gridplot as grid, row, column
from bokeh.models import LinearAxis, Range1d
import censusdata
from bokeh.themes import built_in_themes, Theme
from bokeh.io import curdoc
from bokeh import palettes
#curdoc().theme = 'dark_minimal'
plt.style.use('fivethirtyeight')
sns.set_context('poster')
bkplt.output_notebook()
colors = palettes.Set3[6]
states = ['North Carolina', 'Maryland', 'California', 'New York', 'Texas', 'Michigan']
populations = [10385000, 6045000, 39500000, 19450000, 29000000, 10000000]
scaler = preprocessing.MinMaxScaler(feature_range=(0, .9))
with open('customTheme.json') as f:
d = json.load(f)
curdoc().theme = Theme(json = d)
COVID-19 data is sourced from Novel Corona Virus 2019 Dataset - Kaggle, GeoJson county level files from Opendatasoft, and US Census data is pulled from the CensusData Api
First, using pandas, we read in the COVID-19 dataset into a DataFrame, pandas own two dimensional data structure.
cov19df = pd.read_csv("time_series_covid_19_deaths_US.csv")
cov19df.iloc[-2:]
Next, using GeoPandas, we read in the geoJSON files downloaded from ODS, into a GeoDataFrame. This is nothing but a pandas DataFrame with defined columns for geometry, but the use of the GeoPandas library allows easy conversion from a geoJSON into a DataFrame. You may notice the line:
mdGdf.at[15, 'name'] = 'Baltimore City'
This is an example of Data pre-processing, since the counties of Baltimore City as well Baltimore County were both marked as 'Baltimore' in the name column of our GeoDataFrame, later merges with the COVID-19 dataset (which contain a 'Baltimore' and 'Baltimore City') would be unsuccesful or incomplete if not resolved. Manually fixing data can be a painstaking yet neccesary part of data science.
nc_geo = "ncCounties.geojson"
tx_geo = "txCounties.geojson"
ny_geo = "nyCounties.geojson"
ca_geo = "caCounties.geojson"
md_geo = 'mdCounties.geojson'
mi_geo = 'miCounties.geojson'
ncGdf = gpd.read_file(nc_geo)
txGdf = gpd.read_file(tx_geo)
nyGdf = gpd.read_file(ny_geo)
caGdf = gpd.read_file(ca_geo)
mdGdf = gpd.read_file(md_geo)
miGdf = gpd.read_file(mi_geo)
mdGdf[mdGdf['name'] == 'Baltimore']
mdGdf.at[15, 'name'] = 'Baltimore City'
mdGdf[mdGdf['name'] == 'Baltimore']
Last we will use the CensusData Api to collect population figures for each county in our selected states, these numbers will also be stored in a DataFrame. To use the CensusData Api we need the ANSI FIPS codes for our states, which our convienently located in the GeoDataFrames we just produced.
def censusData(stateCode):
censusDf = censusdata.download('acs5', 2015, censusdata.censusgeo(
[('state', stateCode),('county', '*')]), ['B01001_001E'])
censusDf = censusDf.reset_index()
censusDf['name'] = censusDf.apply(lambda x: str(x['index']).split(" Count")[0] , axis = 1)
censusDf = censusDf.rename(columns = {'B01001_001E': 'pop'}).drop(['index'], axis = 1)
return censusDf
ncCensus = censusData(ncGdf.iloc[0]['statefp'])
mdCensus = censusData(mdGdf.iloc[0]['statefp'])
txCensus = censusData(txGdf.iloc[0]['statefp'])
nyCensus = censusData(nyGdf.iloc[0]['statefp'])
caCensus = censusData(caGdf.iloc[0]['statefp'])
miCensus = censusData(miGdf.iloc[0]['statefp'])
ncCensus.iloc[-2:]
To make things easier, lets gather our DataFrames in a single data structure.
frames = {"Maryland":{"census": mdCensus, "gdf": mdGdf},
"North Carolina":{"census": ncCensus, "gdf": ncGdf},
"Michigan":{"census": miCensus, "gdf": miGdf},
"Texas":{"census": txCensus, "gdf": txGdf},
"New York":{"census": nyCensus, "gdf": nyGdf},
"California":{"census": caCensus, "gdf": caGdf}}
Now we have our geographic and census DataFrames stored together, but our COVID-19 dataset is still in an unusable form. Before we seperate it into different our different states, we need to make it Tidy, currently the DataFrame has one row for every county, and columns for every day, with other attributes in columns as well, to make this data tidy, we need to make every observation (every day, for every county) into its own row. To do this we use melt a pandas function to unpivot our DataFrame. We do this for every state and then add these frames to our dictionary of frames.
for f in frames:
#select the state's covid data
state_covid_data = cov19df[cov19df['Province_State'] == f]
# using melt
state_covid_data = state_covid_data.rename(columns = {"Admin2": "name"})
state_covid_data = pd.melt(state_covid_data,id_vars='name', value_vars = list(state_covid_data.columns[12:332].values),
var_name = 'date', value_name = 'deaths')
frames[f]['covid'] = state_covid_data
Now that we have all of our data, we could get straight to plotting and analysis, but to make our lives easier we should join these frames. We will use 'merge' the pandas function analogous to SQL joins. Since our rows will now all contain state data as well, we can drop the use of a dictionary to hold our frames, and append them together.
df = pd.DataFrame()
for f in frames:
df = df.append(frames[f]['covid'].merge(frames[f]['census'], on = 'name', how = 'inner').merge(frames[f]['gdf'], on = 'name', how = 'inner'))
df.iloc[-2:]
Date is currently being represented as a string, which isn't optimal, lets convert it now to datetime.
df['datetime'] = df.apply(lambda x: datetime.strptime(x['date'], '%m/%d/%y'), axis = 1)
Since population density will be of use to use later, lets go ahead and calculate that now. The column 'aland' is the area in land in $m^{2}$ for each county, originally obtained from our GeoJSON files, I'll be using $km^{2}$ so you'll notice a factor of $1000^{2} = 1000000$ coming into play. We also want state death totals for each day, so I'll do that as well, using a groupby operation, which like merge, is analagous to the same SQL statement
dfTotals = df.groupby(by = ['datetime', 'state_name']).sum().reset_index()
df = df.reset_index(drop = True)
df['pDensity'] = df.apply(lambda x: (x['pop']/x['aland'])*1000000, axis = 1)
dfTotals['pDensity'] = dfTotals.apply(lambda x: (x['pop']/x['aland'])*1000000, axis = 1)
display(df.iloc[-2:], dfTotals.iloc[-2:])
Now that we have county and state level DataFrames of tidy data, we can move onto analyis.
I'll be primarily using Bokeh for visualization, since Bokeh allows interaction once exported to html, which matplotlib/seaborn do not, at least not in an easy to apply fashion. To get a feel for some trends visually, I'll plot three things:
To plot efficiently, I'll zip two lists, one of the names of my states, and another of my chosen colors from the palette I defined during my environment setup.
states_colors = zip(states, colors)
f1 = bkplt.figure(plot_width = 1200, plot_height = 600, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths',
title ="Cumulative Deaths for Selected States", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']),sizing_mode="stretch_both")
f2 = bkplt.figure(plot_width = 1200, plot_height = 600, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Capita',
title ="Cumulative Deaths Per Capita for Selected States", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
f3 = bkplt.figure(plot_width = 1200, plot_height = 600, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Population Density',
title ="Cumulative Deaths Per Population Density for Selected States", x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
for state,color in states_colors:
df1 = dfTotals[dfTotals['state_name'] == state].copy()
df1['var2'] = df1.apply(lambda x: x['deaths']/x['pop'], axis = 1)
df1['var3'] = df1.apply(lambda x: x['deaths']/x['pDensity'], axis = 1)
f1.line(df1['datetime'], df1['deaths'], line_width = 6, alpha = .8, color = color, legend_label= state,)
f2.line(df1['datetime'], df1['var2'], line_width = 6, alpha = .8, color = color, legend_label= state,)
f3.line(df1['datetime'], df1['var3'], line_width = 6, alpha = .8, color = color,legend_label= state)
for f in [f1,f2,f3]:
f.legend.location = "top_left"
f.legend.title = "Click To Hide"
f.legend.title_text_color = "White"
f.legend.click_policy="hide"
show(column(f1,f2,f3, sizing_mode = 'stretch_width'))
The extremely fast rise of deaths in New York immediately pops out at us, both on an absolute as well as per capita basis. Viewing certain pairs of states also serce for interesting comparisons. Try hiding every state besides California, Texas, and Michigan in the per capita plot. We see an initial surge from Michigan, while Texas and California remain tightly paired. As the weather heats up though, deaths in Texas rose dramatically. I would guess this is due to more people congregating inside during the summer months, compared to the more comfortable summers of CA and MI. This trend reverses as the weather cools off, and Michigan deaths begin to rise dramatically. I would assume this is a similiar phenomenon, as the weather becomes unbearable, so do outside gatherings, and more people are bound to catch COVID-19 inside. When controlling for population density in the third graph, the problem with comparing states to eachother becomes apparent. Texas deaths skyrocket, but this is almost certainly due to the many largely uninhabited counties with extremly large land areas, skewing their population density far lower than the density the average Texan actually lives at. Therefore it is clearly neccesary to do further exploratory data analysis at a county level before attempting to build a and train a model.
To get a better understanding of what's going on at the county level, I'll make two plots for each state, one displaying deaths per capita for each county, and one displaying deaths while controlling for population density, like done in the last two statewide plots.
plots = []
states_colors = zip(states, colors)
for state, color in states_colors:
f2 = bkplt.figure(plot_width = 800, plot_height = 500, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Capita',
title ="Cumulative Deaths Per Capita for Counties in "+ state, x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
f3 = bkplt.figure(plot_width = 800, plot_height = 500, x_axis_type = 'datetime', x_axis_label = 'Date', y_axis_label = 'Deaths/Population Density',
title ="Cumulative Deaths Per Population Density for Counties in " + state, x_range = (df.iloc[40]['datetime'], df.iloc[-1]['datetime']))
df1 = df[df['state_name'] == state].copy()
df1['var2'] = df1.apply(lambda x: x['deaths']/x['pop'], axis = 1)
df1['var3'] = df1.apply(lambda x: x['deaths']/x['pDensity'], axis = 1)
df1 = df1.sort_values(by = 'datetime')
county_count = len(df1['name'].unique())
top5 = list(df1[-county_count:].sort_values(by = 'deaths')['name'][-5:])
top5.reverse()
for s in df1['name'].unique():
alpha = .1
line_width = 2
df2 = df1[df1['name'] == s].copy()
if( s in top5 ):
f2.line(df2['datetime'], df2['var2'], alpha = .5, color = color, line_width = 3, muted_alpha=1, legend_label= s, muted_line_width = 6)
f3.line(df2['datetime'], df2['var3'], alpha = .5, color = color, line_width = 3, muted_alpha=1, legend_label= s, muted_line_width = 6)
continue
f2.line(df2['datetime'], df2['var2'], alpha = .1, color = color, line_width = 2)
f3.line(df2['datetime'], df2['var3'], alpha = .1, color = color, line_width = 2)
for f in [f2,f3]:
f.legend.location = "top_left"
f.legend.title = "Click To Hide"
f.legend.title_text_color = "White"
f.legend.click_policy="mute"
f.legend.title = "Top 5 "+state+" Counties (by deaths): Click To Highlight"
plots.append([f2,f3])
show(grid(plots, sizing_mode = 'stretch_width'))